Universal Access To All Knowledge
Home Donate | Store | Blog | FAQ | Jobs | Volunteer Positions | Contact | Bios | Forums | Projects | Terms, Privacy, & Copyright
Search: Advanced Search
Anonymous User (login or join us)
Upload

Reply to this post | See parent post | Go Back
View Post [edit]

Poster: Kokonor Date: Nov 19, 2013 2:46am
Forum: texts Subject: dead links / downloading pdf files

Thank you. It is my understanding that the Wayback Machine did not save versions of the .pdf(s) I would like copies of. Thank you if you might suggest some other service/ program that might be available and that crawls the web and makes copies available.

Sincerely,

Koknor

Reply to this post
Reply [edit]

Poster: aibek Date: Nov 19, 2013 3:40am
Forum: texts Subject: Re: dead links / downloading pdf files

Hello

I found one more:
http://web.archive.org/20091229085355/http://old.thdl.org:80/community/pdfs/honriwaterrep.pdf

1) I was thinking of something like this:
https://chrome.google.com/webstore/detail/web-cache/coblegoildgpecccijneplifmeghcgip

I tried it for a couple of community/pdfs/ but I could find nothing.

2) You could try searching for just the name of the pdf file. This may turn up useful results. For example, on searching for ‘honriwaterrep.pdf’ on Google, I was led to the following page
http://comments.gmane.org/gmane.education.english.teflchina.jobs/4234
where I found another url for the file, and Wayback Machine had a copy of this file. (the one quoted on the top of this post.)

(Please note that there is no point in trying the Wayback Machine for archives of old.thdl.org. I have already checked that Wayback Machine has archived only the above mentioned file from old.thdl.org/community/pdfs/.)

3) Another thing you could try is the following. The sites containing the pages in the result may have a “local copy” of the pdfs you are looking for.
https://www.google.com/search?q=thdl.org%2Fcommunity%2Fpdfs%2F
So, e.g., the top result is the LukeWater file. You may try visiting that file, and hope that it has something useful! Perhaps an alternate link, or a “local copy”, or just an extracted text copy.

4) Finally, you could try writing to relevant mailing lists, etc, asking people to check if they have the desired files. Also, if you have the hard drives which had the files, you could try recovering them.

Do let me know how the search progresses! Just reply to any post by me and I would get an email notification.

---
General stuff:
https://en.wikipedia.org/wiki/Wikipedia:Link_rot#Repairing_a_dead_link

This post was modified by aibek on 2013-11-19 11:40:21

Reply to this post
Reply [edit]

Poster: aibek Date: Nov 19, 2013 5:18am
Forum: texts Subject: Re: dead links / downloading pdf files

By the way, the following is how I have been searching for the files.

The documentation:
https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server

Please do CDX queries responsibly. (These are highly resource intensive.)

All the files archived from www.thdl.org/community/pdfs/:
http://web.archive.org/cdx/search/cdx?url=http://www.thdl.org/community/pdfs/&;matchType=prefix&limit=1000&
output=json

You can construct the Wayback Machine links by the following formula. (Both timestamp and original without the double-quotes of course.)
web.archive.org/timestamp/original


All the files archived from old.thdl.org/community/pdfs/:
http://web.archive.org/cdx/search/cdx?url=http://old.thdl.org/community/pdfs/&;matchType=prefix&limit=1000&
output=json

The field ‘original’ contains the full url, so using regex I am checking if a file called ‘schoolrep.pdf’ is found at any location on thdl.org:
http://web.archive.org/cdx/search/cdx?url=thdl.org&;matchType=host&output=json&limit=50&filter=original:.*schoolrep.pdf

Using regex, I search for all pdf files with size between 1MB and 10MB:
http://web.archive.org/cdx/search/cdx?url=thdl.org&;matchType=host&output=json&limit=50&filter=length:.......&filter=mimetype:application/pdf

(Note that in the above, limit is set to 50, so it returns only the first 50 entries. No point in wasting IA resources by asking it to run pointless errands!)

Note also that the ‘length’ field does not exactly correspond to the file size. I don’t know what it is, but I know that it is always approximately equal to the file size that IA has.

This post was modified by aibek on 2013-11-19 13:18:55